Lecture 6: tidyr and pivoting

Tidy data

“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.”

Wickham (2014)

Tidy data

In a rectangular, tidy data set:

  • Each variable forms a column
  • Each observation forms a row

Example

Variable 1 Variable 2 Variable 3
a 1 x
b 2 y
c 3 z

Why tidy?

  • Tidy data makes it easy to visualize, aggregate and model data.

How to make data tidy?

library(exscidata) 
library(tidyverse)

cyclingstudy %>%
        select(subject, timepoint, lac.125:lac.375) %>%
        print(n = 4)
# A tibble: 80 × 11
  subject timepoint lac.125 lac.175 lac.225 lac.250 lac.275 lac.300 lac.325
    <dbl> <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1       1 pre          1.5     1.86    2.38    3.54    6.21   NA      NA   
2       2 pre          1.19    1.49    2.34    3.21    5.33   NA      NA   
3       3 pre          1.17    1.52    1.22    1.54    2.04    3.32    4.72
4       4 pre          0.88    0.99    2.13    3.25   NA       6.15   NA   
# ℹ 76 more rows
# ℹ 2 more variables: lac.350 <dbl>, lac.375 <dbl>

Pivoting, from wide to long

cyclingstudy %>%
  select(subject, timepoint, 
         lac.125:lac.375) %>%
  pivot_longer(
          values_to = "lactate", 
          names_to = "watt", 
          cols = lac.125:lac.375) %>%
        print()
# A tibble: 720 × 4
   subject timepoint watt    lactate
     <dbl> <chr>     <chr>     <dbl>
 1       1 pre       lac.125    1.5 
 2       1 pre       lac.175    1.86
 3       1 pre       lac.225    2.38
 4       1 pre       lac.250    3.54
 5       1 pre       lac.275    6.21
 6       1 pre       lac.300   NA   
 7       1 pre       lac.325   NA   
 8       1 pre       lac.350   NA   
 9       1 pre       lac.375   NA   
10       2 pre       lac.125    1.19
# ℹ 710 more rows

Pivoting, from wide to long

cyclingstudy %>%
  select(subject, timepoint, 
         lac.125:lac.375) %>%
  pivot_longer(
          values_to = "lactate", 
          names_to = "watt", 
          cols = lac.125:lac.375, 
          names_prefix = "lac.") %>%
        print()
# A tibble: 720 × 4
   subject timepoint watt  lactate
     <dbl> <chr>     <chr>   <dbl>
 1       1 pre       125      1.5 
 2       1 pre       175      1.86
 3       1 pre       225      2.38
 4       1 pre       250      3.54
 5       1 pre       275      6.21
 6       1 pre       300     NA   
 7       1 pre       325     NA   
 8       1 pre       350     NA   
 9       1 pre       375     NA   
10       2 pre       125      1.19
# ℹ 710 more rows

Pivoting, from wide to long

cyclingstudy %>%
  select(subject, timepoint, 
         lac.125:lac.375) %>%
  pivot_longer(
          values_to = "lactate", 
          names_to = "watt", 
          cols = lac.125:lac.375, 
          names_prefix = "lac.",
          names_transform = list(watt = as.numeric)) %>%
        print()
# A tibble: 720 × 4
   subject timepoint  watt lactate
     <dbl> <chr>     <dbl>   <dbl>
 1       1 pre         125    1.5 
 2       1 pre         175    1.86
 3       1 pre         225    2.38
 4       1 pre         250    3.54
 5       1 pre         275    6.21
 6       1 pre         300   NA   
 7       1 pre         325   NA   
 8       1 pre         350   NA   
 9       1 pre         375   NA   
10       2 pre         125    1.19
# ℹ 710 more rows

From long to wide

cyclingstudy %>%
  select(subject, timepoint, 
         VO2.max) %>%
  pivot_wider(names_from = timepoint, 
              values_from = VO2.max) %>%
        print()
cyclingstudy %>%
  select(subject, timepoint, 
         VO2.max) %>%
  pivot_wider(names_from = timepoint, 
              values_from = VO2.max) %>%
        print()
# A tibble: 20 × 5
   subject   pre meso1 meso2 meso3
     <dbl> <dbl> <dbl> <dbl> <dbl>
 1       1 5629   5209  5485  5330
 2       2 4471   4792  4662  4886
 3       3 5598.  5698  5920  6380
 4       4 4944.  4782  4877  5171
 5       5 5748   5694    NA    NA
 6       6 4633.  4706  4695  4864
 7       7 4250   4562  4165  4402
 8       8 4760.  5093  5266  5172
 9       9 4629   5247    NA  4864
10      10 5226   5360  5666  5755
11      11 4063   4149  3812  4077
12      13 4981.  4874    NA  5045
13      14 4444   4360  4316  4636
14      15 4486   4780  4637  4991
15      16 4604   4866  4904  5087
16      17 4366   4520  4277  4483
17      18 5135.  4802  5445  5315
18      19 4018.  4019  3447  4404
19      20 4739   5125  4899  5119
20      21 4753.  4436  4228  4730

Summary

  • Tidy data makes e.g. visualisation, aggregation and modelling easy
  • To reshape data from a wide format to a (tidy) long format we use pivot_longer from the tidyr package
  • The reverse operation is performed with pivot_wider

References

Wickham, Hadley. 2014. “Tidy Data.” Journal Article. Journal of Statistical Software; Vol 1, Issue 10 (2014). https://www.jstatsoft.org/v059/i10.